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Abstract 

Background: Chromatin immunoprecipitation followed by next generation sequencing (ChlP-seq), enables 
unbiased and genome-wide mapping of protein-DNA interactions and epigenetic marks. The first step in ChlP-seq 
data analysis involves the identification of peaks (i.e., genomic locations with high density of mapped sequence 
reads). The next step consists of interpreting the biological meaning of the peaks through their association with 
known genes, pathways, regulatory elements, and integration with other experiments. Although several programs 
have been published for the analysis of ChlP-seq data, they often focus on the peak detection step and are usually 
not well suited for thorough, integrative analysis of the detected peaks. 

Results: To address the peak interpretation challenge, we have developed ChlPseeqer, an integrative, 
comprehensive, fast and user-friendly computational framework for in-depth analysis of ChlP-seq datasets. The 
novelty of our approach is the capability to combine several computational tools in order to create easily 
customized workflows that can be adapted to the user's needs and objectives. In this paper, we describe the main 
components of the ChlPseeqer framework, and also demonstrate the utility and diversity of the analyses offered, 
by analyzing a published ChlP-seq dataset. 

Conclusions: ChlPseeqer facilitates ChlP-seq data analysis by offering a flexible and powerful set of computational 
tools that can be used in combination with one another. The framework is freely available as a user-friendly GUI 
application, but all programs are also executable from the command line, thus providing flexibility and 
automatability for advanced users. 



Background 

The use of chromatin immunoprecipitation in combina- 
tion with high-throughput sequencing (ChlP-seq) has 
enabled the study of genome-wide mapping of protein- 
DNA interaction and epigenetic marks. By sequencing 
millions of immunoprecipitated DNA fragments in a sin- 
gle experiment, ChlP-seq outperforms the array-based 
ChlP-chip (Chromatin Immunoprecipitation followed by 
DNA microarray hybridization) technology in terms of 
quality, specificity, and coverage [1-3], and has the poten- 
tial to greatly improve our understanding of the mechan- 
isms underlying transcriptional regulation [4-8]. Many 
peak detection methodologies and software tools have 
been developed for the analysis of ChlP-seq data since the 
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introduction of the technology [2,3,9,10]. Although peak 
detection is important for the analysis of ChlP-seq data, it 
is only the first step. Additional computational tools are 
needed to help interpret the genome-wide transcription 
factor binding and histone mark enrichment patterns 
revealed by peak detection procedures. We have developed 
ChlPseeqer, a comprehensive computational framework 
that enables broad, but also in-depth, analysis of ChlP-seq 
data. The framework includes: (1) gene-level annotation of 
peaks, (2) pathways enrichment analysis, (3) regulatory 
element analysis, using either a de novo approach, known 
or user-defined motifs, (4) nongenic peak annotation 
(repeats, CpG islands, duplications, published ChlP-seq 
datasets), (5) conservation analysis, (6) clustering analysis, 
(7) visualization tools, (8) integration and comparison 
across different ChlP-seq experiments. These components 
share a common architecture: they take as input a set of 
ChlP-seq peaks, perform the defined analysis, and output 
one or more sets of peaks that can be used by any of the 
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other tools provided. Thus, our framework was designed 
to offer the flexibility required to perform complex ana- 
lyses by defining custom workflows. In principle, ChlPsee- 
qer can use peaks generated by any peak-calling program 
as initial input, assuming these peaks are in a simple and 
standard format (i.e., chromosome, start position, end 
position). For convenience, ChlPseeqer also includes its 
own fast and accurate peak finding algorithm, previously 
evaluated and compared with other algorithms in Qin et 
al. [11]. Details and additional validation of our peak 
detection algorithm are provided in Additional file 1. By 
analyzing a published ChlP-seq dataset, we show how the 
modular nature of the framework can help the users create 
easily computational pipelines and address sophisticated 
biological questions. 

Comparison with related work 

Although numerous approaches already exist for the ana- 
lysis of ChlP-seq data, many of them focus on the peak 
detection process and provide few or no tools for the 
interpretation of ChlP-seq peaks [5-8,12-19]. Several fra- 
meworks for interpreting ChlPseq datasets have nonethe- 
less been developed, all of which have strengths and 
limitations. For example, the Cistrome project (unpub- 
lished, [20]) integrates several analysis tools, such as gene 
annotation, motif and pathways analysis, and peak con- 
servation. However, users must upload their ChlP-seq 
data to the Cistrome server, which will be increasingly 
time-consuming and less practical as the size of ChlP-seq 
datasets increases. Moreover, feeding the results of an 
analysis directly into another tool within the same frame- 
work can be difficult for some users because each tool 
supports different input formats. Galaxy [21-25], which is 
used as Cistrome's backend, is a powerful framework but 
possibly too general for the analysis of ChlP-seq data. In 
particular, Galaxy does not allow control over several 
important parameters in ChlP-seq data analysis, such as 
the maximum or minimum distance between the peak 
and its closest gene in the peak-gene association task 
[26]. CisGenome [27] also supports tools for the inte- 
grated analysis of ChlP-seq data, such as annotation of 
peaks with their neighbor genes, conservation analysis, 
and motifs discovery. However, CisGenome does not 
allow the correlation of peaks and their target genes with 
Gene Ontology (GO) terms and pathways that could 
shed light on the biological processes and pathways con- 
trolled by the transcription factors (TFs) or histone mod- 
ifications assayed by ChlP-seq. The GPAT program [26] 
provides systematic annotation of genomic positions in 
general. It uses gene annotation from different public 
databases, such as RefSeq and Ensembl, and provides 
access to the expression status of the corresponding 
genes from existing transcriptomic databases, or user- 
generated expression datasets. The limitation of GPAT is 



the lack of other tools for the analysis of ChlP-seq data, 
apart from genomic annotation. Thus, GPAT users can- 
not perform motif discovery, pathways enrichment, and 
other analyses that are useful in the ChlP-seq context. 
EpiChIP [28] offers gene-based enrichment analysis of 
ChlP-seq datasets. In particular, EpiChIP looks for 
enrichment of the ChlP-seq reads over the control sam- 
ple in specific regions of the genes, such as the 5'- or 3'- 
end, exons or introns. This approach has the advantage 
of identifying directly the genes that are enriched in the 
TF or the histone modification of the reference dataset. 
However, the program lacks in providing further annota- 
tion of the enriched regions. The seqMINER platform 
[29] aims at integrating and comparing different ChlP- 
seq datasets in terms of read density. The algorithm first 
estimates for a set of genomic regions (i.e., reference 
dataset) the read density of multiple ChlP-seq datasets. 
Clustering and visualization methods are then provided 
to show groups of regions with similar binding features. 
Although this approach is useful to integrate ChlP-seq 
datasets, it focuses on the comparison of read density 
profiles and does not integrate other sources of informa- 
tion, such as the motifs and pathways enrichment or the 
level of conservation. HOMER [30] provides a suite of 
programs originally developed for motif discovery, and 
later for ChlP-seq peak detection. Although it includes 
tools for gene annotation, clustering, and visualization of 
the peaks (e.g., histograms, heatmaps), it does not sup- 
port conservation analysis and can only run from the 
command-line. BEDTools [31] is a UNIX-based collec- 
tion of utilities that allow common operations on geno- 
mic features in general (e.g., find overlaps between two 
files with genomic intervals, extract FASTA sequences 
from genomic intervals). Although the BEDTools are 
designed to provide fast solutions to basic operations on 
large data volumes produced by DNA sequencing, they 
do not offer computational tools for the functional inter- 
pretation of ChlP-seq peaks (e.g., motifs and pathways 
analysis). Their command-line nature also demands extra 
effort and computer skills from users. CEAS [32] is a 
stand-alone extension of a web application previously 
developed for ChlP-chip data [33], but is also offered 
through the Cistrome framework. The tool provides 
basic annotation tools for ChlP-seq data, such as the esti- 
mation of peaks distribution across the genome, identifi- 
cation of genes associated with peaks by proximity, and 
more. However, one drawback of CEAS, when using it 
through the Cistrome application, is that it produces gra- 
phical representation of the results and does not output 
lists of peaks that belong to specific genomic categories 
(e.g., promoters, introns). PeakAnalyzer [34] can subdi- 
vide ChlP-seq peaks that have multiple sites of enrich- 
ment into smaller peaks; this procedure may facilitate 
more detailed analysis of individual subpeaks. It also 
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offers annotation functions, and can locate the nearest 
downstream genes and transcription start sites for each 
peak. It can also determine overlapping peaks between 
different datasets. Although PeakAnalyzer allows the user 
to perform gene annotation, motifs analysis, annotation 
with functional elements, and comparisons across data- 
sets, it does not support the functional interpretation of 
the ChlPseq results through their association with path- 
ways. On the other hand, GREAT [35] is a web applica- 
tion that supports the analysis of functional significance 
of ChlP-seq peaks using 20 different information sources 
(e.g., Gene Ontology, PANTHER pathway, Pathway 
Commons, InterPro). Importantly, the tool integrates not 
only proximal but also distal binding events to obtain a 
gene-based p-value for enrichment [35]. However, 
GREAT does not offer an automated way to retrieve lists 
of genes and peaks associated with a specific pathway, 
Gene Ontology term, or motif. This feature (provided in 
our framework) would enable users to perform further 
and more targeted analysis on subsets of the initial peaks, 
which were found to be functionally significant. In con- 
trast, ChlPseeqer has many advantages compared to 
these programs. First, it offers a variety of tools that 
cover not only basic gene annotation, but also a wide 
range of computational analyses, including motif analysis, 
pathways enrichment, estimation of conservation, read 
density analysis and more. Second, ChlPseeqer allows the 
comparison of multiple datasets based not only on read 
density profiles, but also on peak binding overlap, and 
integration with other ChlP-seq datasets. Third, the fra- 
mework provides a straightforward and effortless connec- 
tion between the tools; no data format transformation is 
needed to combine the tools and perform a comprehen- 
sive and sophisticated data analysis. Fourth, the frame- 
work allows the users to control all parameters of the 
analysis, such as the minimum distance away from tran- 
scripts, the upstream distance from transcription start 
site (TSS), the database annotation and more. In addition, 
ChlPseeqer runs locally on the user's computer enabling 
the analysis of very large datasets. Finally, it provides a 
user-friendly graphical interface that can be used effort- 
lessly even by non-expert users. 

Implementation 

Software distribution and availability 

ChlPseeqer is available as a set of standalone command 
line tools. For advanced users, command line tools pro- 
vide great flexibility and automatability. For less 
advanced users, we have made these tools available via a 
graphical user interface (GUI), developed using the 
multi-platform QT framework [36]. The bundle (i.e., 
command line tools and GUI) has been tested on Linux 
and Mac OS X. Detailed installation instructions and 
documentation for all tools included in the framework 



are also available online [37]. Our implementation is 
available as free software, released under the GNU Gen- 
eral Public License (GPL) v3 [38]. 

Comparison of genomic intervals 

Many computations performed in ChlPseeqer involve 
assessing overlaps between hundreds or thousands of 
genomic regions (i.e., peaks, transcripts/genes, gene 
parts), and therefore, efficient algorithms are needed to 
quickly determine and characterize these overlaps. In 
ChlPseeqer, fast comparison on genomic intervals is per- 
formed using interval trees [39,40], ordered tree struc- 
tures that store and index intervals with fast querying 
and processing times, and ensure efficient searching of all 
indexed intervals that overlap with any given interval or 
point. An interval tree is an augmented binary search 
tree: each node contains an interval and also stores the 
maximum endpoint of the subtree rooted at the particu- 
lar node. Apart from the insert and delete operations that 
characterize the binary search trees, interval trees also 
support a query operation that allows searching the tree 
for overlaps with a given interval. The first step of the 
algorithm sets the root of the tree as current node. The 
second step checks if the given interval overlaps with the 
current node; if not, it compares the low endpoint of the 
given interval with the maximum value stored at the left 
child of the current node. If the low endpoint of the 
interval is lower than the maximum value stored, then 
the current node is set to the left child; otherwise it is set 
to the right child. Then the algorithm goes back to the 
first step and repeats the same procedure for the new 
current node until it finds an overlap of the given interval 
with the interval stored in the current node, or until the 
whole tree is explored. Of note, we are using a modified 
implementation of the original algorithm [40], so as not 
to stop at the first overlapping interval but find all inter- 
vals that overlap with the given one. Moreover, we use a 
randomization procedure that takes into account the nat- 
ural clumping of features [41] to assess the statistical sig- 
nificance of the observed number of overlapping peaks 
between two peak files. This procedure consists of gener- 
ating many "random" lists of peaks maintaining peak 
sizes and number of peaks as well as the chromosomal 
and genomic distribution of the peaks in the first peak 
file. The latter means that each random list of peaks 
maintains the same fraction of peaks in promoter, exonic, 
intronic, downstream, and intergenic regions as the origi- 
nal peak file. Then, for each random peak list, the num- 
ber of overlapping peaks with the second peak file (kept 
unchanged) is calculated. A p-value is determined by 
counting the number of times the random overlap is 
equal to or greater than the originally observed number 
of overlapping peaks. Additionally, the z-score is esti- 
mated, representing the distance (in number of standard 
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deviations) betweem the observed number of overlapping 
peaks and the average number of overlapping peaks 
expected by chance. 

Supported formats, annotations, and species 

One of the advantages of the framework is the support of 
different formats that are well-established in deep sequen- 
cing experiments, such as SAM, BAM, eland, extended 
eland, bed and export. ChlPseeqer also provides gene- 
based annotation from multiple sources and databases, 
such as RefSeq, Ensembl, UCSC Genes, and AceView. 
Finally, four species are currently supported, namely 
Homo sapiens, Mus musculus, Drosophila melanogaster, 
and Saccharomyces cerevisiae. Support for additional spe- 
cies can be added to the framework as described in Addi- 
tional file 1 and in our online documentation. 

Results 

ChlPseeqer user-defined workflows 

ChlPseeqer is a comprehensive and fully integrated frame- 
work offering a dry-lab workbench for the processing and 
analysis of ChlP-seq data. Table 1 summarizes the basic 
tools in the framework along with a short description of 
their functionality and their availability in the ChlPseeqer 
interface. A detailed description of these tools is provided 
in the ChlPseeqer modules section. The framework 
includes a peak detection program as well as tools for per- 
forming quality control of the raw reads (see Additional 
file 1). However, the most interesting aspect of ChlPseeqer 
is the variety of independent analysis modules, all of which 
have the same structure: (1) they take as input a list (or in 
some cases lists) of peaks in a simple tab-delimited format, 
(2) perform a given analysis, and (3) output one or more 
peak lists. These modules can be used in any order, since 
their input and output are peak lists of the same format 
(i.e., chromosome, start position, end position). Thus, the 
novelty of the framework is the capability to combine 
these modules and design specific workflows that enable 
multi-step bioinformatics analyses of ChlP-seq data, 
according to the user's objectives and hypotheses. Figure 1 
shows two scenarios that combine some of the ChlPseeqer 
modules. These scenarios are indicative examples based 
on our experience in analyzing several ChlP-seq datasets, 
and others could be considered as well. For example, a 
potential workflow (see Figure 1A) involves: 

(1) running the peak detection algorithm for a TF 
(e.g., ETS) ChlP-seq dataset, 

(2) finding the peaks that have a specific motif (e.g., 
the ELK1 motif) using the ChlPseeqerMotifMatch 
module, 

(3) identifying the peaks that bind at the promoters of 
known RefSeq genes using ChlPseeqerAnnotate, and 

(4) performing pathways analysis on these genes with 
ChlPseeqeriPAGE, in order to find biological processes 



in which the given TF is likely involved. Another work- 
flow (see Figure IB) identifies putative enhancers based 
on TF and histone modifications ChlP-seq data. In that 
case, the intergenic peaks are first detected using ChlP- 
seeqerAnnotate, and then the peaks that also overlap 
with enhancer marks [42] are reported using Compar- 
elntervals. The corresponding subset of peaks represents 
putative enhancers; to discover informative regulatory 
elements within these peaks, unsupervised de novo 
motif analysis can be performed using ChlPseeqerFIRE. 
Finally, the ChlPseeqerCons module can be used to com- 
pare the conservation between putative enhancers and 
random genomic regions, in order to determine enhan- 
cers that are most likely to be functional. 

Use of ChlPseeqer - Example 

To illustrate the power and flexibility of ChlPseeqer, we 
analyzed a published ETS1 ChlP-seq [43], performed in 
Jurkat T cells. ETS1 is an oncogene [44] and member of 
the ETS family of eukaryotic transcription factors. It is 
preferentially expressed at high levels in B and T cells, and 
plays a critical role in T cell activation [45] . Recent studies 
based on chromatin immunoprecipitation have shown 
ETS1 binding events in both promoters and enhancers in 
Jurkat T cells [43,46]. 

ETS1 binds to thousands of locations and is associated 
with binding sites of other TFs 

Our peak detection algorithm identified 9,065 ETS1 
peaks. We associated these peaks with genes using the 
ChlPseeqerAnnotate module and the RefSeq annotation 
(Figure 2A). This analysis revealed a large occupancy of 
ETS1 peaks at the promoters of the genes (-67%), but 
also that ETS1 binding occurs on intergenic regions 
(-17%), at least 2 kb away from known TSS. Using the 
lists of these promoter peaks and distal peaks automati- 
cally generated by ChlPseeqerAnnotate, we performed 
"supervised" motif analysis (using known motifs) and de 
novo motif discovery. In the supervised analysis on the 
promoter peaks ChlPseeqerMotifMatch module), we 
determined subsets of ETS1 peaks that contain motif 
occurrences for other ETS family members (e.g., SPI1, 
SPIB, ELK1) [43,47,48], using motif weight matrices from 
the JASPAR [49] and UniPROBE [50] databases. ChlPsee- 
qerMotifMatch reveals that a large fraction of ETS1 peaks 
(more than 73%) contain such matches (see Figure 2B). 
The unsupervised analysis {ChlPseeqerFIRE module) for 
the distal peaks revealed that multiple motifs appear 
from the same regions, among which ETS-domain motifs 
(e.g., ELK1), but also motifs resembling binding elements 
recognized by non-ETS related factors (see Figure 2C). 
For example, the HLF motif, which is bound by the hepa- 
tic leukemia factor and has been implicated in childhood 
B-lineage acute lymphoid leukemias, was also found in 
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Table 1 The main tools of the ChlPseeqer framework 


Tool name 


Description 


GUI 
availability 


QcAnalysisTools 


Offers basic quality control tools. 


NA 


ChlPseeqerSplitReadFiles 


Splits read files (e.g., bed, eland) into one read file per chromosome. 


V 


ChlPseeqer 


Peak detection algorithm. 


V 


ChlPseeqerSummaryPromoters 


Creates a promoters-based annotation of the detected peaks (i.e., gene name-description, 
peaks) 


V 


ChlPseeqerAnnotate 


Finds the peaks distribution in the genome (e.g., exons/introns/intergenic) and creates lists of 
these peaks. 


V 


ChlPseeqerPeaksTrack 


Creates a UCSC Genome Browser track for the detected peaks. 


V 


ChlPseeqerMakeReadDensityTrack 


Creates a UCSC Genome Browser track for the reads density. 


V 


ChlPseeqerNongenicAnnotate 


Finds the peaks that over ap with repeating elements, CpG islands and segmental dup icates 


J 

V 


ChlPseeqerFIRE 


Runs FIRE for the detected peaks, in order to perform an unsupervised motif discovery. 


V 


ChlPseeqerMotifMatch 


Runs MyScanACE for the detected peaks, in order to look for specific motifs (Jaspar, Bulyk PBM 
databases). 


V 


ChlPseeqeriPAGE 


Runs PAGE for the genes associated with the detected peaks, in order to perform pathways 
ana ys is. 


V 


Ch 1 PseeqerPath wayMatch 


Looks for genes (and their corresponding peaks) that are associated to a specific pathway (e. 
g, apoptosis, GO:0060742). 


V 


ChlPseeqerCons 


Estimates the conservation scores for the detected peaks and for random intervals to allow 
comparison 


V 


Chi PseeqerDensityMatrix 


Creates a reads density matrix for a window around the TSS or the TES of the genes, or for 
any interval selected. 


NA 


ChlPseeqerReadCountMatrix 


Estimates the avg/max reads number for every input peak, across multiple ChlP-seq datasets 
and creates a peak-based reads matrix. 


NA 


ChlPseeqerCluster 


Clusters a matrix (e.g., k-means, hierarchical, SOMs) and visualizes the clustering. 


NA 


Comparelntervals 


Compares two lists or peaks and rinds the overlapping peaks and the peaks that are unique in 
each list. 


V 


CompareGenes 


Compares two lists of genes and finds the common genes and the genes that are unique in 
each list 


V 


ChlPseeqerComputeJaccard Index 


Estimates the Jaccard similarity coefficient for a set of peak files. The larger the coefficient, the 
more similarity you have between two peak files 


j 

V 


ChlPseeqerMakeGenepartsMatrix 


Creates gene-based matrices (one for promoters, one for exons, etc) for many peak files. 
Summarizes the number of peaks that fall in specific gene parts, across many different peak 
files (TFs). 


NA 


ChlPseeqerFindDistalPeaks 


Finds peaks that are away from known genes. 


NA 


ChlPseeqerFindClosestGenes 


Finds the closest gene(s) for each peak. 


NA 


ChlPseeqerGetReadCountlnPeakRegions 


Estimates the avg/max reads number for every peak, for a ChlP-seq dataset and creates a 
peak-based read matrix. 


NA 


FindPeaksWithMotif 


Extracts the peaks that have a specific FIRE motif (can be applied after running FIRE). 


NA 


MakePAGEInput 


Creates the input file for iPAGE from a list of genes. 


NA 



The table shows the names of the tools, short description of their functionality and their availability within the ChlPseeqer interface. This is not an exhaustive list; 
all available tools are documented online [37]. 



the intergenic ETS1 peaks. A motif resembling the AML- 
la/RUNXl binding sites was discovered as well using 
this de novo analysis. RUNX1 is a TF associated with sev- 
eral types of leukemia and is known to bind to T cell 
receptor enhancers [43]. The RUNX1 association with 
ETS1 distal peaks led us to look for ETS1 binding in 
putative enhancers as described in the next section. 

ETS1 binds to many putative enhancers 

To identify and characterize putative enhancers among 
the ETS1 peaks, we use the list of distal peaks obtained 
from the ChlPseeqerAnnotate analysis. To better identify 



enhancers, we also analyzed the CBP ChlP-seq dataset 
described by Hollenhorst et al. [43] (also Jurkat T cells), 
as well as the ChlP-seq histone marks datasets of primary 
CD4 + T cells described by Barski et al. [51]. CBP protein 
shares regions of very high sequence similarity with p300, 
a protein that binds to many enhancers [52]. Moreover, 
several studies [42,53,54] have suggested high levels of 
H3K4mel combined with low levels of H3K4me3 as a 
signature for predicting enhancers. Our peak detection 
algorithm identified 8,246, 41,426 and 30,797 enriched 
regions for CBP, H3K4mel and H3K4me3 datasets 
respectively. In order to locate the putative enhancers, we 
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Figure 1 Workflow use cases. Examples of workflows that can be easily generated using tools from the ChlPseeqer framework are shown. The 
starting point is always the result of peak detection: a set of enriched regions/peaks. (A) The aim of the workflow is to analyze a subset of the 
peaks that have a specific motif. From all the peaks that have the motif, we look for those that bind in the promoters of known genes. Pathways 
analysis is then performed on these genes in order to reveal enriched pathways associated with this particular subset of peaks. (B) This workflow 
allows locating and characterizing distal regulatory elements (i.e., intergenic peaks) that overlap with enhancer marks (e.g., H3K4me1 binding), in 
terms of motifs and conservation. Different workflows can be created using any combination of the ChlPseeqer tools. 



looked for the ETS1 distal peaks that have histone signa- 
ture for enhancers-presence of H3K4mel and absence of 
H3K4me3- and are also co-occupied by CBP. Using the 
ChlPseeqer Comparelntervals module, we identified such 
peaks (see Figure 3). First, we determined all ETS1 distal 
peaks (1,550) that also overlap with at least one peak in 
the H3K4mel dataset (232 peaks). Second we looked for 
peaks that have absence of H3K4me3 marks (191 peaks), 
and finally, we determined which of the remaining peaks 
(i.e., ETS1 distal peaks with H3K4mel but without 
H3K4me3 marks) overlap with at least one CBP peak 
(163). Statistical assessment of the overlaps described 
here showed that not all of them were different from 
chance expectation (data not shown). However, in what 
follows, the 163 peaks obtained from this analysis are 
considered to be putative ETSl-binding enhancers. We 



then performed unbiased motif discovery for this set of 
putative enhancers, using the ChlPseeqerFIRE module. 
This analysis revealed over-representation of two ETS 
domain-related motifs, the ELK-1 and the c-ETS motifs 
(see Figure 3) in the putative ETSl-binding enhancers 
peaks, and under-representation in the random regions. 
Finding these highly enriched motifs in such a small sub- 
set of peaks (that is -0.18% of the initial pool of peaks 
and -10% of the ETS1 distal peaks), but not in random 
regions, shows that the putative enhancers were not arbi- 
trarily identified. 

Moreover, to examine whether these regions are also 
conserved (and thus probably functional), we performed 
conservation analysis using the ChlPseeqerCons module. 
One of the capabilities of this module is to determine the 
conservation profile in and around the ChlP-seq peaks, 
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Figure 2 Analysis of the ETS1 ChlP-seq dataset. (A) The ChlPseeqerAnnotate module outputs the distribution of the ETS1 binding peaks in 
gene parts, as well as several lists of peaks that were found in a specific gene part (e.g., promoters, exons, introns). (B) The occurrence of specific 
motifs among the ETS1 peaks is shown, after using ChlPseeqerMotifMatch. The underlined motifs represent transcription factors of the ETS 
domain. (C) Unsupervised motif discovery, using ChlPseeqerFIRE, reveals multiple motifs that derive from the same regions. The fraction of ETS1 
peaks containing at least one instance of each motif is given, with the expected frequency of the motif in the random regions given in the 
parentheses. 
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Figure 3 Identification of putative enhancers. This workflow shows the identification of putative enhancers, by progressively filtering the 
distal peaks with histone modification enhancer marks (i.e., presence of H3K4me1 and absence of H3K4me3) and CBP binding. De novo motif 
discovery and conservation analysis were then performed, which showed highly enriched ETS-domain motifs and high conservation scores in 
the set of putative enhancers compared to random regions. 
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using the phastCons scores [55]. The conservation pro- 
files in randomly selected regions were also determined, 
in order to compare the level of conservation between 
the input peaks and randomly generated genomic 
regions. After averaging the conservation profiles in the 
two groups (i.e., peaks and random regions), higher level 
of conservation was indeed noticed for the group of puta- 
tive enhancers, as can be seen in Figure 3. In this exam- 
ple, we showed how the identification of putative 
enhancers could be performed in ChlPseeqer, starting 
from the distal peaks and progressively filtering them 
with histone modification enhancer marks and CBP bind- 
ing. De novo motif discovery and conservation analysis 
were then performed, which revealed highly enriched 
ETS-domain motifs and high conservation scores for the 
set of putative enhancers compared to random genomic 
regions. To further investigate the biology of ETS1 bind- 
ing in Jurkat T cells, we also looked for potential biologi- 
cal pathway differences between the genes associated 
with ETS1 binding peaks in promoters (6,053 peaks) and 
with ETS1 binding peaks in intergenic regions (1,550 
peaks). We used the ChlPseeqeriPAGE module, in order 
to find pathways that can discriminate the two groups of 
genes, (i.e., pathways enriched in one group but not in 
the other). We used the two lists of genes as input, one 
list for each group. Using the Gene Ontology annotation, 
we noticed a higher enrichment of T and B cell related 
pathways [43] in the distal peaks group, such as leukocyte 
differentiation, lymphocyte activation, immune response, 
immune system development and others (Table 2), rather 
than in the promoter peaks group. We also observed 
similar results using SignatureDB [56], a database of gene 
expression signatures mainly derived from B and T cells. 
In particular, we found a significantly higher enrichment 
of many T cell-related pathways and gene sets in the dis- 
tal peaks compared to the promoter peaks (Table 2), 
such as the signatures "Tcell_PIind_CsAdown4x" [57] 
and "Thymic_SP_CD4+Tcell_gt_Blood_CD4+TceH" [58]. 

The former signature originates from a study focusing 
on the signalling pathways network downstream of the 
T cell receptor [57], explaining the gene expression 
changes during T cell activation, whereas the latter sig- 
nature comes from the analysis of phenotypic and func- 
tional parameters of T cell differentiation stages by 
studying human thymocytes, an important organ of T 
cell production [58]. On the other hand, the promoter 
peaks group was highly associated with more general 
pathways and gene sets, such as RNA processing, RNA 
splicing, metabolic process, proliferation and others 
(Table 2). These results are consistent with previous 
findings [43], where ETS1 bound intergenic regions 
were associated with genes involved in T cell specific 
functions, while ETS1 occupancy in promoters occurred 
at genes related to housekeeping functions. Using 



CompareGenes, a tool in our framework that allows 
comparisons on the gene level, we also looked for genes 
that have ETS1 peaks in their promoters (6,053 promo- 
ter peaks) and in intergenic regions (163 putative 
enhancer peaks). This analysis gave us 39 genes (see 
Table 3). One hypothesis that can be formed is that 
binding of ETS1 at both promoters and enhancers of 
these 39 genes mediates looping of the distal elements 
onto proximal promoters. Thus, these genes may be 
regulated by ETS1 through a chromatin looping event. 
This prediction can be further tested using chromosome 
conformation capture based techniques [59-61]. In sum- 
mary, using the ChlPseeqer framework on a published 
ETS1 ChlP-seq dataset we showed that: 

• There is a large occupancy of ETS1 peaks at the 
promoters but also at intergenic regions. 

• ETS1 regions are bound by multiple motifs, either 
from the ETS-domain or non-ETSrelated (e.g., HLF, 
RUNX1). 

• Specific pathways are preferentially related to genes 
with ETS1 binding in promoters or intergenic regions. 

• It is straightforward to characterize ETS1 binding to 
genomic regions with enhancers signatures (e.g., 
H3K4mel+/H3K4me3-). 

• It is possible to determine a list of genes that may be 
regulated by an enhancer-bound 

TF, through a chromatin-looping event. Although 
these analyses could have been performed by combining 
several published tools or custom scripts, in ChlPseeqer 
this is fast (see Performance Evaluation section in Addi- 
tional file 1) and straightforward and does not requiring 
any programming knowledge. Thus, by creating custom 
workflows that combine powerful computational pro- 
grams, ChlPseeqer facilitates the comprehensive and in- 
depth analysis of ChlP-seq datasets. 

ChlPseeqer modules 

This section demonstrates the diversity and versatility of 
the framework by presenting basic ChlPseeqer modules 
in further detail. A more exhaustive description of these 
tools is available online [37]. 

ChlPseeqerAnnotate: Gene-level annotation of peaks 

This module associates a set of ChlP-seq peaks, given as 
input, with the closest genes in the genome, and deter- 
mines where each peak is located in these genes. In parti- 
cular, ChlPseeqerAnnotate classifies input peaks in 
categories, (i.e., promoter, distal, intergenic, intronic, exo- 
nic and downstream). This module generates lists of peaks 
found in each of these classes of genomic regions. This 
analysis is controlled by default or user-defined parameters 
controlled by the user, such as the promoter window 
around the TSS and the annotation database. For example, 
promoters are defined by default as 4kb-long regions, 
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Table 2 Pathways analysis between the ETS1 distal and promoter peaks 







Distal Peaks 


Promoter peaks 


T/B cell related 


Gene Ontology 








Leukocyte differentiation, GO:0002521 


p < 0.001 


P < 1 




Lymphocyte activation, GQ0046649 


p < 0.001 


P < 1 




Hemopoiesis, GO:0030097 


p < 0.001 


P < 1 




Hemopoietic or lymphoid organ development, GO:0048S34 


p < 0.001 


P < 1 




Immune response, GO:0006955 


p < 0.01 


P < 1 




Immune system development, GO:0002520 


p < 0.01 


P < 1 




B cell proliferation, GO:0042100 


p < 0.01 


P < 1 




B cell activation, GO:00421 13 


p < 0.001 


P < 1 


Others 


Biopolymer catabolic process, GO:0043285 


P < 1 


p < 1e-29 




RNA splicing, GO:0008380 


P < 1 


p < 1e-50 




DNA metabolic process, GO:0006259 


P< 1 


p < 1e-29 


T/B cell related 


SignatureDB 








Tcell_Plind_CalciumDefPtdown4x_Feske_Fig6 


p < 1e-05 


P < 1 




CD40_upregulated_Burkitt_lymphoma 


p < 0.001 


P < 1 




CD40_downregulated_Burkitt_lymphoma 


p < 0.01 


P < 1 




Pax5_repressed 


p < 0.01 


P < 1 




Tcell_Plind4x_Feske_Fig6 


p < 1e-08 


P < 1 




Tcell_Plind_CsAdown4x 


p < 1e-04 


P < 1 


Others 


RibosomaLprotein 


p< 1 


p < 1e-06 




Myeloma_PR_subgroup_up 


p< 1 


p < 1e-05 


The table shows some of the pathways and lymphoma-related signatures that were found enriched in the distal peaks and the promoter peaks groups. The 
distal peaks group was highly associated with T cell and B cell related ontologies and signatures, while for the promoter peaks group more general and 
housekeeping categories were enriched. The Gene Ontology and the SignatureDB gene expression signatures were used for this analysis (ChlPseeqeriPAGE 
module). The p-values for each pathway for both groups are also shown. 



around the TSS but not extended further than the down- 
stream extremity of the genes. Moreover, ChlPseeqerAnno- 
tate reports peaks overlapping with the first intron, since it 
has been reported that some first introns play a vital role 
in transcriptional control and splicing [62]. The tool also 
determines peaks not overlapping with any gene part, but 
found to be at least at a user-defined distance away from 
known genes (default is set to 2 kb away). We call these 
peaks distal or intergenic, because they occur in intergenic 
regions (known to contain important regulatory elements 
such as enhancers and insulators). The lists of peaks gen- 
erated by ChlPseeqerAnnotate can be directly used in 
other tools within the framework to perform subsequent 
analyses. This can be useful to focus further analyses on 
specific classes of peaks (e.g. promoter peaks or intergenic 
peaks); ChlPseeqerAnnotate lets users extract these subsets 
of peaks. ChlPseeqerAnnotate has other interesting fea- 
tures. It generates a gene-based matrix that summarizes 
the number of peaks found in each gene part (e.g., promo- 
ters, exons, introns, intergenic) and can help the user iden- 
tify quickly the peak binding occurring in their gene of 
interest. We have also developed tools that merge and 
combine this matrix output, in order to extract, for exam- 
ple, the genes with both promoter and intergenic peaks. 
The expected fraction of peaks in different genomic cate- 
gories (e.g., promoters, introns, exons, intergenic) is also 



provided, based on the fraction of the genome in each of 
these regions, so that it can be compared to the observed 
fraction of the input peaks in each category. Finally, sev- 
eral widely used gene annotations are supported, such as 
the RefSeq, Ensembl, UCSCGenes, and AceView (other 
databases can also be easily added). This feature enables 
comparing the peak-gene association results between dif- 
ferent databases. 

Pathways analysis modules 

Pathways analysis can help elucidate important biological 
mechanisms associated with genome-wide binding and 
histone modification patterns, and can be performed 
after peaks have been associated with genes, as described 
in ChlPseeqerAnnotate. We have integrated two pathways 
analysis modes in the framework that involve: (1) looking 
for a given pathway of interest within the genes asso- 
ciated with input peaks and (2) looking for any pathways 
that are enriched in these genes. Pathways annotations 
are obtained from the Gene Ontology [63], KEGG data- 
base [64], Biocarta pathways [65], the SignatureDB online 
resource [56], and the Reactome pathways [66]. Impor- 
tantly, both modes generate lists of peaks associated with 
genes in the query pathway or in the enriched ones. 
These peaks can then be used as input to other tools in 
the framework. 
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Table 3 List of the 39 genes with both promoter and 
distal ETS1 peaks 

# Gene ID Gene Description 

AKAP11 A-kinase anchor protein 11 2 
AKR1A1 alcohol dehydrogenase 3 

ATP50 ATP synthase subunit 0, mitochondrial precursor 4 

CI orfl 09 hypothetical protein LOC54955 5 

C2orf29 hypothetical protein LOC55571 6 

C9orfl23 transmembrane protein C9orf123 7 

CDK9 cell division protein kinase 9 8 

CHSY1 chondroitin sulfate synthase 1 9 

CKAP2L cytoskeleton-associated protein 2-like 10 

CLINT1 clathrin interactor 1 1 1 

DUSP2 dual specificity protein phosphatase 2 12 

DUSP6 dual specificity protein phosphatase 6 isoform 13 

HSPC157 hypothetical LOC29092 14 

KIAA0427 CBP80/20-dependent translation initiation factor 15 
LDHA L-lactate dehydrogenase A chain isoform 5 16 
LOC100188949 hypothetical LOC100188949 17 
LOC285456 hypothetical LOC285456 1 8 
LSM14B protein LSM14 homolog B 19 
MAX protein max isoform a 20 

MRPS18A 28S ribosomal protein S18a, mitochondrial 21 

MTF2 metal-response element-binding transcription 22 

NAIF1 nuclear apoptosis-inducing factor 1 23 

NDUFA10 NADH dehydrogenase [ubiquinone] 1 alpha 24 

POMP proteasome maturation protein 25 

PSMA6 proteasome subunit alpha type-6 26 

RBM16 putative RNA-binding protein 16 27 

RBM38 RNA-binding protein 38 isoform a 28 

RPN1 dolichyl-diphosphooligosaccharide-protein 29 

SEPHS2 selenide, water dikinase 2 30 

SIRPG signal-regulatory protein gamma isoform 1 31 

SPRED2 sprouty-related, EVH1 domain-containing protein 32 

TFRC transferrin receptor protein 1 33 

TMEM18 transmembrane protein 18 34 

TRIP13 thyroid receptor-interacting protein 13 isoform 35 

TXN2 thioredoxin, mitochondrial precursor 36 

UBE2D2 ubiquitin-conjugating enzyme E2 D2 isoform 1 37 

ZFAT zinc finger protein ZFAT isoform 1 38 

ZNF212 zinc finger protein 212 39 

ZNF683 zinc finger protein 683 

The table shows the 39 genes that were found to have both promoter and 
intergenic ETS1 peaks. It is possible that ETS1 binding at the promoters and 
enhancers of these genes is explained by looping of the distal elements onto 
proximal promoters. This hypothesis could be tested using chromosome 
conformation capture based techniques. 

1. ChlPseeqerPathwayMatch: User-specified pathway 
analysis 

When using this tool, the user first selects which of the 
input peaks to include in the analysis, based on their 
association with genes. For example, only peaks that 
belong to promoter regions can be included (Figure 4A). 
Alternatively, all peaks can be included, irrespective of 



where they reside within gene regions. Then, the user 
either selects the desired pathway from a list of available 
pathways in the provided pathway database (e.g Gene 
Ontology), or directly enters a pathway name (e.g., 
apoptosis, GO:0060742). ChlPseeqerPathwayMatch then 
finds all genes that belong to the selected pathway and 
outputs the corresponding ChlP-seq peaks. These peaks 
can then be used as input to other modules (e.g., for 
regulatory elements analysis). The hypergeometric distri- 
bution is used to assess the statistical significance of the 
pathway association: it determines whether the input 
peaks are associated with more genes in the query path- 
way than expected by chance (see Additional file 1). 
2. ChlPseeqeriPAGE: Pathways analysis using iPAGE 
In order to discover highly enriched pathways in the 
genes associated with input peaks, we use iPAGE [67], 
an information-theoretic pathway analysis framework. In 
iPAGE, sets of genes are used as input, and pathways 
that are enriched in each set are reported [67]. ChlPsee- 
qeriPAGE, is the module that integrates iPAGE within 
user-defined workflows. As with the previous module, 
users choose which input peaks to include in the analy- 
sis, based on their association with genes. The program 
then outputs the lists of peaks associated with specific 
enriched pathways. 

Regulatory element analysis modules 

Regulatory element analysis of ChlP-seq peaks can dis- 
cover the DNA sequence motifs bound by the TF 
assayed by ChlP-seq, and/ or to find sequences bound by 
its co-factors. We have integrated two regulatory ele- 
ment analysis modes in the framework: (1) analysis 
based on known motifs or user-defined motif patterns, 
and (2) de novo motif analysis. Both analyses require 
extracting DNA sequences under the peaks from the 
genome reference sequence. Efficient extraction is per- 
formed by pre-indexing genomes using the SAMTools 
C library [68]. 

7. ChlPseeqerMotifMatch: User-specified regulatory element 
analysis 

Several software tools support searching for peaks that 
match a specific motif, but they often have limitations 
that restrict their usability. For example, in Cistrome 
[20] it is not straightforward to look for a specific 
motif, since available motifs are not shown to the user 
(only the available motif databases are). In HOMER 
[30], only motifs previously detected by the software 
are available for searching; the integration with public 
and popular sequence motif databases such as JASPAR 
that would enlarge the pool of available motifs is lim- 
ited. ChlPseeqerMotifMatch seeks to overcome some of 
these limitations. To perform known motif analysis in 
ChlPseeqerMotifMatch, the user either selects the 
desired motif from a compiled dataset of ~250 TF 
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Figure 4 ChlPseeqer graphical interface. (A) The users can control all parameters of the tools. For example, in the Find Pathway tool (the GUI 
version of ChlPseeqerPathwayMatch) the user can select: the input peaks, the species of their data, the gene annotation database used to extract 
the genes related to the input peaks, which subset of the peaks to include in the analysis (e.g., promoter peaks, intergenic peaks), and which 
pathways database to use in order to look for the pathway. The desired pathway can be either selected from a list of available pathways or 
typed by the user (e.g., apoptosis, development). (B) The typical output of each tool is a table summarizing all peaks resulting from the analysis, 
as well as basic statistics (e.g., how many peaks found). Here, the peaks that contain the TCCTAGA motif are shown, after using the Find Motif in 
peaks tool (the GUI version of ChlPseeqerMotil 'Match). (C) Several tools also provide graphical output. For example, the summary result of iPAGE 
tool (the GUI version of ChlPseeqeriPAGE) is a pathway enrichment table showing the level of enrichment for all pathways found in the genes 
related to the input peaks (category 1), compared to the genes used as background (category 0). (D) The output of the Similarity coefficient tool 
(the GUI version of ChlPseeqerComputeJaccardlndex) is a color-coded matrix, showing the pairs of datasets that have more common peaks than 
others, with darker red color. 



binding sites (defined as position-specific weight 
matrices) from JASPAR [49] and UniPROBE [50] data- 
bases, or provides a consensus sequence in the form of 
regular expressions (e.g., TCCAAT, [AT]CG[CT]). In 
the former case, peak regions are scanned using the 
Berg and Von Hippel method [69] and a user-defined 
affinity threshold [70], and peaks containing one or 
more occurrence of the motif are given as output. 
Additional information such as motif positions within 
the peak and orientation are also reported. In the latter 
case, user-specific consensus sequences are used 
instead of weight matrices, and peak regions are 



scanned using regular expression matching algorithms 
from the pcre C library [71]. 

2. ChlPseeqerFIRE: De novo regulatory element analysis 

De novo motif analysis is performed using FIRE, an infor- 
mation-theoretic methodology for identification and 
characterization of regulatory elements [72]. In order to 
search for any informative motifs that are highly enriched 
within the detected ChlP-seq peaks, background 
sequences are first created. These background sequences 
can be extracted either randomly across the entire gen- 
ome (option "random"), or immediately adjacent to the 
peak regions (option "adjacent"). They can also be 
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extracted so as to preserve the C+G and CpG content of 
the input sequences using two different options. The 
"CGI" option estimates the fraction of the original peaks 
that overlap with CpG islands, and then produces ran- 
dom background regions that maintain this fraction of 
CpG islands overlap. Alternatively, using the "1MM" 
option, the program calculates for each input peak 
sequence 1st order Markov frequencies and uses these 
frequencies to generate new random sequences. As 
shown in Additional File 1 Figure S10, both CGI and 
1MM preserve C+G and CpG frequencies of the input 
peaks. The option that the users should use depends on 
the question they want to address (e.g., Are there are any 
DNA motifs enriched in ChlP-seq peaks compared to 
regions flanking these peaks?, Are there any DNA motifs 
enriched in ChlP-seq peaks compared to randomly 
selected genomic regions with similar lengths and 
nucleotide compositions?). After identification of the 
motifs that best explain the distinction between peak 
regions and background sequences, peak lists containing 
each motif are extracted and can be used as input to 
other tools in the framework, such as pathway analysis 
tools. 

ChlPseeqerNongenicAnnotate: Nongenic peak annotation 

The ability to integrate the results of a ChlP-seq study 
with existing and publicly available ChlP-seq is an impor- 
tant. For example, this integration could suggest tran- 
scription factor, co-factors or histone modification that 
should be further explored because of their extensive 
overlap with a set of ChlP-seq peaks. In the ChlPseeqer 
framework, the ChlPseeqerNongenicAnnotate provides 
such capabilities; it can determine the subset of input 
peaks that overlap with peaks obtained from TF or his- 
tone modification ChlP-seq datasets of the ENCODE 
project [73,74], as well as the statistical significance of 
this overlap (ENCODE datasets are subject to the 
ENCODE data usage policy available at http://genome. 
ucsc.edu/ENCODE/terms.html). ChlPseeqerNongenicAn- 
notate can perform additional integrative analyses. For 
example, extensive literature has shown that TF binding 
sites and specific histone modifications can be associated 
with repeated elements [75] and other nongenic ele- 
ments, such as CpG islands [76]. Filtering the peaks 
based on that type of features could reveal interesting 
groups of peaks that have the potential to alter and 
impact gene expression (e.g., possible promoters, retroe- 
lements that impact transcriptional networks [75] and 
more). Using the ChlPseeqerNongenicAnnotate module 
and track-based data from the UCSC Genome Browser, 
users can quickly and easily determine which of their 
input ChlP-seq peaks overlap with: (1) known repeated 
sequences (identified by RepeatMasker [77]), (2) CpG 
islands and (3) segmental duplications. While such 



comparisons can be performed via the UCSC Table 
Browser (i.e., use intersection between any two tracks), 
ChlPseeqerNongenicAnnotate facilitates these analyses 
and allows their integration with other analyses within 
the framework. 

ChlPseeqerCons: Conservation analysis 

Cross-species conservation analysis is necessary in order 
to discover functional genomic elements (e.g., distal regu- 
latory elements) and also to prioritize the most promising 
genomic elements for experimental validation. For these 
reasons, we have developed ChlPseeqerCons, a tool that 
estimates the conservation for a given set of peaks and 
outputs the peaks whose average conservation score is 
greater than a user-defined threshold (default is set to 

0. 5). The most useful aspect of this module, is estimating 
the conservation level of sequences adjacent to the input 
peaks, or of randomly selected sequences, thus allowing 
global assessment of peak conservation. 

Another interesting feature of ChlPseeqerCons is produ- 
cing conservation profiles for regions around the summit 
of the peaks (default is 2kb-long regions), and for random 
intervals: the average conservation score is estimated for 
every «-sized bins of the regions (default is n = 10 nucleo- 
tides). By plotting the resulting conservation profiles, we 
can easily compare the level of conservation between the 
input peaks and randomly generated genomic regions (see 
Figure 3). ChlPseeqerCons uses the phastCons [55] or phy- 
loP [78] scores (freely available as tracks from the UCSC 
Genome Browser website), calculated from placental 
mammalian genomes or primates. 

Analysis of read density profiles 

The analysis of read density profiles, when combined with 
clustering methods, can help identify groups of genes with 
similar binding profiles in their promoters, or groups of 
peaks that tend to have similar histone modification or TF 
binding patterns. In ChlPseeqer, we have developed tools 
that take as input a set of genomic regions and: (1) calcu- 
late the read density profile of the regions (split regions 
into bins and calculate the average read count within each 
bin), (2) count the maximum or average number of ChlP- 
seq reads for each genomic region. These tools can also 
perform RPKM-style read count normalization [79] prior 
to read counting, in order to compare multiple experi- 
ments with different numbers of short reads. 

1. ChlPseeqerDensityMatrix: Read density matrix 
ChlPseeqerDensityMatrix lets the users explore and ana- 
lyse the average read density profiles, either for user- 
defined regions around the TSS or TES of the genes 
(default is set to 4 kb around the TSS), or around the 
summit of the given peaks (default is set to 2 kb window 
centred to the peak summit). For each region, bins of n 
nucleotides (default is 10 nucleotides) are created and the 
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average number of reads for each bin is counted. For 
example, if 4 kb regions are extracted around the TSS of 
genes, the result of this module will be a matrix that con- 
tains for each promoter the average number of reads for 
400 bins of 10 nucleotides each. Clustering the promoters 
of the resulting matrix is then performed based on 
their read density profiles, using either the built-in Self- 
Organizing Map algorithm [80], or by interfacing with 
Cluster 3.0 software [81,82] (ChlPseeqerCluster). The 
results can be directly visualized using Postscript/PDF 
heatmaps produced using our built-in visualization tools 
or using TreeView [83] (included in the framework). 
Lists of genomic regions for each cluster can be exported 
and then used as input into other tools, in order to 
answer questions such as "Are there any regulatory ele- 
ments associated with a given promoter binding pattern?" 
and "Which pathways discriminate between promoter 
binding patterns?". 

2. ChlPseeqerReadCountMatrix: Read count for multiple 
ChlP-seq experiments 

ChlPseeqerReadCountMatrix estimates for each of the 
input peaks the maximum or average number of reads 
for multiple ChlP-seq datasets. The result of this module 
is a matrix that contains, for each of the input peaks, the 
maximum or average reads count for every ChlP-seq 
experiment. Similarly, the ChlPseeqerCluster module can 
be used to cluster the peaks of this matrix based on the 
reads number across multiple datasets, in order to reveal 
groups of peaks that share common binding in several 
TFs or histone modifications. The clusters of peaks can 
be extracted and used as input into other tools. 

Integration and comparison of ChlP-seq experiments 

As more and more ChlP-seq datasets become publicly 
available, the need for data integration and comparison is 
becoming essential. Such integration can reveal how dif- 
ferent TFs cooperate to regulate gene expression [84], as 
well as the interplay between TF binding and histone 
modifications [85,86]. The integration between ChlP-seq 
datasets can be realized by determining the overlap 
between sets of peaks. In ChlPseeqer, we have addressed 
this need by implementing fast interval tree-based tools 
for comparing ChlP-seq experiments at the peak level 
(Comparelntervals). These tools can be used to compare 
sets of peaks, and quickly: (1) identify overlapping peaks, 
(2) merge sets of peaks, or (3) determine peaks in the 
first set that do not overlap with any peaks from the sec- 
ond set (i.e. find unique peaks). Moreover, as described 
in previous section, these tools can assess the significance 
of the overlap between two sets of peaks using randomi- 
zation tests that take into account the genomic distribu- 
tion of peaks. In addition to simply counting how many 
peaks overlap between two peak lists, we provide tools 
that can also quantify the extent to which two sets of 



peaks overlap by estimating the pairwise Jaccard similar- 
ity coefficient between pairs of ChlP-seq datasets (ChlP- 
seeqerComputeJaccardlndex). The Jaccard index is 
estimated as the number of peaks that overlap between 
two peak files, divided by the union of the two files. The 
larger the coefficient, the more similar two datasets are 
in terms of overlapping peaks (see Figure 4D). Such com- 
parisons can also be performed at the gene level; we have 
developed similar tools for gene-based comparisons 
(CompareGenes) that can be easily used on the genes- 
based output of ChlPseeqerAnnotate. Finally, the annota- 
tion of peaks against a collection of ENCODE ChlP-seq 
datasets {ChIP seeqerNongenic Annotate) , as well as the 
read density analysis across multiple datasets {ChlPsee- 
qerReadCountMatrix), both described in previous para- 
graphs, were also developed in the context of integration 
and comparison of multiple ChlP-seq experiments. 

Visualization tools 

Visualization is tightly integrated to all modules of the fra- 
mework in order to facilitate ChlP-seq data exploration 
and summarize the results of each analysis. ChlPseeqer 
includes tools for creating UCSC Genome Browser tracks 
representing peak location and genome-wide read densi- 
ties. It also includes tools for drawing pie charts summar- 
izing the genomic distribution of the peaks, creating motif 
and pathway enrichment tables and conservation plots 
(see Figure 1, Figure 4C). The output of clustering read 
density profiles can be visualized either using heatmaps, or 
2D Kohonen maps [80]. Finally, we provide tools (ChlPsee- 
qerPlotAverageReadDensitylnGenes, ChlPseeqerPlotAvera- 
gePeaksNumberlnGenes) for the visualization of reads 
density and peaks number in gene parts (e.g., promoters, 
exons, introns). The description of these tools and exam- 
ples of the visualization they provide can be found at the 
ChlPseeqer tutorial [37]. 

Discussion 

The ChlPseeqer framework can considerably facilitate 
the bioinformatics analysis of ChlP-seq data by providing 
an integrated suite of computational tools that are fast, 
easy to use (no programming experience required), and 
can be combined with each other. The variety of tools 
and the flexibility offered by their parameters (Figure 4A) 
makes it possible to address most biological questions 
that are often raised when analyzing ChlPseq datasets. 
Notably, as demonstrated before, ChlPseeqer users can 
create personalized workflows in order to perform speci- 
fic but sophisticated analysis, often requiring integration 
of multiple datasets (Figure 4D). 

ChlPseeqer is a continuously developing project and 
we are actively working on implementing several addi- 
tional components. For example, more species will be 
supported soon (e.g., C. elegans, zebrafish, chicken, rat), 
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as well as more visualization components. Moreover, 
facilitating the integration of ChlP-seq data with gene 
expression data, obtained from microarray or RNA- 
sequencing (RNA-seq) experiments, also represents an 
important avenue for future improvement of the 
framework. 

Conclusions 

In order to fill the gap between the identification of 
ChlP-seq peaks and the biological interpretation of the 
data, we have developed ChlPseeqer, a comprehensive 
computational framework that can be adapted to a 
user's needs and to the hypotheses of a ChlP-seq study. 
We showed that using the ChlPseeqer framework we 
can perform sophisticated analyses of ChlP-seq datasets 
(e.g., compare and integrate peak/gene lists), explore the 
data from multiple perspectives (e.g., conservation, 
motifs occurrence, pathways enrichment), and address 
specific biological questions, such as "How do promoter 
peaks differ from distal peaks?" , "Are there genes with 
both promoter and enhancer peaks?" . We believe that 
this framework will be of great assistance to investiga- 
tors who wish to perform high-level analysis of genome- 
wide ChlP-seq datasets, but do not yet possess advanced 
computer programming skills. 

Availability and requirements 

ChlPseeqer is freely available and can be downloaded at 
http:// physiology.med.cornell.edu/ faculty/ elemento/lab/ 
CS_files/ChIPseeqer-2.0.tar.gz The system requirements, 
instructions on how to install and run the software, and 
a detailed tutorial are also provided at http://physiology. 
med.cornell.edu/faculty/ elemento/lab/ chipseq.shtml The 
ETS1 and CBPdatasets used in this paper can be found 
at the GEO (GSE17954), and the H3K4mel and 
H3K4me3 datasets are available at http://dir.nhlbi.nih. 
gov/papers/lmi/epigenomes/hgtcell.aspx 

Additional material 



Additional file 1: This file describes in detail the quality control 
analysis tools and the peak detection algorithm, implemented 
within the ChlPseeqer framework, as well as performance 
evaluation results for several tools of the framework 
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